Distributed Hypertext Resource Discovery Through Examples
نویسندگان
چکیده
We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as “find the number of links from an environmental protection page to a page about oil and natural gas over the last year.” A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased “find similar” search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user’s interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM’s Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.
منابع مشابه
Big Data Resource Discovery Considering Semantics in Grid Environment
Nowadays, everybody talks about the famous phenomenon called ‘Big Data’. No one can escape this term particularly when we talk about large-scale distributed databases, i.e., data grid environment. Resource discovery (data source discovery) is an important step in the management, integration and querying of big data. The addressing protocol adopted for this discovery should respect not only the ...
متن کاملWeb Distributed Authoring and Versioning (WebDAV) Access Control Protocol
This document specifies a set of methods, headers, message bodies, properties, and reports that define Access Control extensions to the WebDAV Distributed Authoring Protocol. This protocol permits a client to read and modify access control lists that instruct a server whether to allow or deny operations upon a resource (such as HyperText Transfer Protocol (HTTP) method invocations) by a given p...
متن کاملmRDP: An HTTP-based lightweight semantic discovery protocol
Discovery is one of the most important activities in ubiquitous and distributed computing, with a plethora of available protocols. Most of these protocols are designed for one concrete purpose: network nodes discovery, service discovery, search of specific information stored through the network, and so forth. Designing a single discovery system able to deal with the particularities of many diff...
متن کاملWeighted-HR: An Improved Hierarchical Grid Resource Discovery
Grid computing environments include heterogeneous resources shared by a large number of computers to handle the data and process intensive applications. In these environments, the required resources must be accessible for Grid applications on demand, which makes the resource discovery as a critical service. In recent years, various techniques are proposed to index and discover the Grid resource...
متن کاملTrusted collaboration in distributed software development
FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE Doctor of Philosophy by Ellis Rowland Watkins Distributed systems have moved from application-specific, bespoke and mutually incompatible network protocols to open standards based on TCP/IP, HTTP, and SGML the foundations of the World Wide Web (WWW). The emergence of the WWW has brought about a revolution...
متن کامل